2023-09-25 09:54:21.AIbase.1.6k
Investigation into the Chaos of Large Model Evaluation: Parameter Scale Does Not Represent Everything
Parameter scale is not the only criterion for assessing large models. Differences in evaluation sets can lead to significant ranking variations. An increase in subjective question proportions can also affect rankings, raising questions about evaluation fairness. Third-party assessment organizations such as OpenCompass and FlagEval are gaining attention. The academic community believes that model robustness, safety, and other dimensions should also be considered. A truly comprehensive and effective evaluation method is still being explored.